Angelic  Hierarchical  Planning:  Optimal  and  Online 

Algorithms 


r 

I  ■  l-  • 


Bhaskara  Marthi 
Stuart  J.  Russell 
Jason  Wolfe 


Electrical  Engineering  and  Computer  Sciences 
University  of  California  at  Berkeley 


Technical  Report  No.  UCB/EECS-2008-150 

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-150.html 


December  6,  2008 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

06  DEC  2008  2  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2008  to  00-00-2008 

4.  TITLE  AND  SUBTITLE 

Angelic  Hierarchical  Planning:  Optimal  and  Online  Algorithms 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROTECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  California  at  Berkeley, Electrical  Engineering  and 

Computer  Sciences, Berkeley, CA, 94720-1700 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBIECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  OS 

unclassified  unclassified  unclassified  Report  (SAR) 

21 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Copyright  2008,  by  the  author(s). 

All  rights  reserved. 

Permission  to  make  digital  or  hard  copies  of  all  or  part  of  this  work  for 
personal  or  classroom  use  is  granted  without  fee  provided  that  copies  are 
not  made  or  distributed  for  profit  or  commercial  advantage  and  that  copies 
bear  this  notice  and  the  full  citation  on  the  first  page.  To  copy  otherwise,  to 
republish,  to  post  on  servers  or  to  redistribute  to  lists,  requires  prior  specific 
permission. 


Acknowledgement 

Bhaskara  Marthi  thanks  Leslie  Kaelbling  and  Tomas  Lozano-Perez  for 
useful  discussions.  This  research 

was  also  supported  by  DARPA  IPTO,  contracts  FA8750-05-2-0249  and 

FA8750-07-D-0185  (subcontract  03- 

000219). 


Angelic  Hierarchical  Planning:  Optimal  and  Online  Algorithms 


Bhaskara  Marthi 

MIT/Willow  Garage  Inc. 

Stuart  Russell 

Computer  Science  Division,  University  of  California,  Berkeley,  CA  94720 

Jason  Wolfe* 

Computer  Science  Division,  University  of  California,  Berkeley,  CA  94720 


BHASKARA@CSAIL.MIT.EDU 

RUSSELL  @CS  .BERKELEY.EDU 


JAWOLFE  @  CS  .BERKELEY.EDU 


Abstract 

High-level  actions  (HLAs)  are  essential  tools  for  coping  with  the  large  search  spaces  and  long  decision 
horizons  encountered  in  real-world  decision  making.  In  a  recent  paper,  we  proposed  an  "angelic”  semantics 
for  HLAs  that  supports  proofs  that  a  high-level  plan  will  (or  will  not)  achieve  a  goal,  without  first  reducing  the 
plan  to  primitive  action  sequences.  This  paper  extends  the  angelic  semantics  with  cost  information  to  support 
proofs  that  a  high-level  plan  is  (or  is  not)  optimal.  We  describe  the  Angelic  Hierarchical  A*  algorithm,  which 
generates  provably  optimal  plans,  and  show  its  advantages  over  alternative  algorithms.  We  also  present  the 
Angelic  Hierarchical  Learning  Real-Time  A*  algorithm  for  situated  agents,  one  of  the  first  algorithms  to  do 
hierarchical  lookahead  in  an  online  setting.  Since  high-level  plans  are  much  shorter,  this  algorithm  can  look 
much  farther  ahead  than  previous  algorithms  (and  thus  choose  much  better  actions)  for  a  given  amount  of 
computational  effort.  This  is  an  extended  version  of  a  paper  by  the  same  name  appearing  in  ICAPS  ’08. 


1.  Introduction 

Humans  somehow  manage  to  choose  quite  intelligently  the  twenty  trillion  primitive  motor  commands  that 
constitute  a  life,  despite  the  large  state  space.  It  has  long  been  thought  that  hierarchical  structure  in  behavior 
is  essential  in  managing  this  complexity.  Structure  exists  at  many  levels,  ranging  from  small  (hundred-step?) 
motor  programs  for  typing  characters  and  saying  phonemes  up  to  large  (billion-step?)  actions  such  as  writing 
an  ICAPS  paper,  getting  a  good  faculty  position,  and  so  on.  The  key  to  reducing  complexity  is  that  one  can 
choose  (correctly)  to  write  an  ICAPS  paper  without  first  considering  all  the  character  sequences  one  might 
type. 

Hierarchical  planning  attempts  to  capture  this  source  of  power.  It  has  a  rich  history  of  contributions  (to 
which  we  cannot  do  justice  here)  going  back  to  the  seminal  work  of  Tate  (1977).  The  basic  idea  is  to  supply 
a  planner  with  a  set  of  high-level  actions  (HLAs)  in  addition  to  the  primitive  actions.  Each  HLA  admits  one 
or  more  refinements  into  sequences  of  (possibly  high-level)  actions  that  implement  it.  Hierarchical  planners 
such  as  SHOP2  (Nau  et  ah,  2003)  usually  consider  only  plans  that  are  refinements  of  some  top-level  HLAs  for 
achieving  the  goal,  and  derive  power  from  constraints  placed  on  the  search  space  by  the  refinement  hierarchy. 

One  might  hope  for  more;  consider,  for  example,  the  downward  refinement  property,  every  plan  that 
claims  to  achieve  some  condition  does  in  fact  have  a  primitive  refinement  that  achieves  it.  This  property  would 
enable  the  derivation  of  provably  correct  abstract  plans  without  refining  all  the  way  to  primitive  actions, 
providing  potentially  exponential  speedups.  This  requires,  however,  that  HLAs  have  clear  precondition- 
effect  semantics,  which  have  until  recently  been  unavailable  (McDermott,  2000).  In  a  recent  paper  (Marthi 
et  ah,  2007)  —  henceforth  (MRW  ’07)  —  we  defined  an  “angelic  semantics”  for  HLAs,  specifying  for  each 
HLA  the  set  of  states  reachable  by  some  refinement  into  a  primitive  action  sequence.  The  angelic  approach 
captures  the  fact  that  the  agent  will  choose  a  refinement  and  can  thereby  choose  which  element  of  an  HLA’s 
reachable  set  is  actually  reached.  This  semantics  guarantees  the  downward  refinement  property  and  yields 

*.  The  authors  appear  in  alphabetical  order. 
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a  sound  and  complete  hierarchical  planning  algorithm  that  derives  significant  speedups  from  its  ability  to 
generate  and  commit  to  provably  correct  abstract  plans. 

Our  previous  paper  ignored  action  costs  and  hence  our  planning  algorithm  used  no  heuristic  information, 
a  mainstay  of  modern  planners.  The  first  objective  of  this  paper  is  to  rectify  this  omission.  The  angelic 
approach  suggests  the  obvious  extension:  the  exact  cost  of  executing  a  high-level  action  to  get  from  state  s  to 
state  s'  is  the  least  cost  among  all  primitive  refinements  that  reach  s'.  In  practice,  however,  representing  the 
exact  cost  of  an  HLA  from  each  state  s  to  each  reachable  state  s'  is  infeasible,  and  we  develop  concise  lower 
and  upper  bound  representations.  From  this  starting  point,  we  derive  the  first  algorithm  capable  of  generating 
provably  optimal  abstract  plans.  Conceptually,  this  algorithm  is  an  elaboration  of  A*,  applied  in  hierarchical 
plan  space  and  modified  to  handle  the  special  properties  of  refinement  operators  and  use  both  upper  and  lower 
bounds.  We  also  provide  a  satisficing  algorithm  that  sacrifices  optimality  for  computational  efficiency  and 
may  be  more  useful  in  practice.  Preliminary  experimental  results  show  that  these  algorithms  outperform  both 
“flat”  and  our  previous  hierarchical  approaches. 

The  paper  also  examines  HLAs  in  the  online  setting,  wherein  an  agent  performs  a  limited  lookahead  prior 
to  selecting  each  action.  The  value  of  lookahead  has  been  amply  demonstrated  in  domains  such  as  chess. 
We  believe  that  hierarchical  lookahead  with  HLAs  can  be  far  more  effective  because  it  brings  back  to  the 
present  value  information  from  far  into  the  future.  Put  simply,  it’s  better  to  evaluate  the  possible  outcomes 
of  writing  an  ICAPS  paper  than  the  possible  outcomes  of  choosing  “A”  as  its  first  character.  We  derive  an 
angelic  hierarchical  generalization  of  Korf’s  LRTA*  (1990),  which  shares  LRTA*’s  guarantees  of  eventual 
goal  achievement  on  each  trial  and  eventually  optimal  behavior  after  repeated  trials.  Experiments  show  that 
this  algorithm  substantially  outperforms  its  nonhierarchical  ancestor. 


2.  Background 

2.1  Planning  Problems 

Deterministic,  fully  observable  planning  problems  can  be  described  in  a  representation-independent  manner 
by  a  tuple  ( S ,  so,  t,  £,  T,  g),  where  S'  is  a  set  of  states.  So  is  the  initial  state,  t  is  the  goal  state,1  £  is  a  set 
of  primitive  actions,  and  T  :  S  x  C  — >  S  and  g  :  S  x  C  — >  R+  are  transition  and  cost  functions  such  that 
doing  action  a  in  state  s  leads  to  state  T(s,  a)  with  cost  g(s,  a).2  These  functions  are  overloaded  to  operate  on 
sequences  of  actions  in  the  obvious  way:  if  a  =  (<n, . . . ,  am),  then  T(s,  a)  =  T(. . .  T(s,  a\) . . . ,  am)  and 
g(s,  a)  is  the  total  cost  of  this  sequence.  The  objective  is  to  find  a  solution  a  £  C*  for  which  T(so,  a)  =  t. 

Definition  1.  A  solution  a*  is  optimal  iff  it  reaches  the  goal  with  minimal  cost: 

a*  =  argminae£,:T(so  a)=t  g(s0,  a). 

We  assume  the  state  and  action  spaces  are  finite.  To  ensure  that  optimal  solutions  exist,  we  also  assume 
that  there  is  at  least  one  finite-cost  solution,  and  every  cycle  in  the  state  space  has  positive  cost.  In  this 
paper,  we  will  represent  S  as  the  set  of  truth  assignments  to  some  set  of  ground  propositions,  and  T  using  the 
STRIPS  language  (Fikes  and  Nilsson,  1971). 

As  a  running  example,  we  introduce  a  simple  “nav-switch”  domain.  This  is  a  grid-world  navigation  do¬ 
main  with  locations  represented  by  propositions  X(x)  and  Y (y)  for  x  £  {0, ...,  xmax }  and  y  £  {0, ...,  ymax}- 
and  actions  U,  D,  L,  and  R  that  move  between  them.  There  is  a  single  global  “switch”  that  can  face  horizon¬ 
tally  (H)  or  vertically  (— H );  move  actions  cost  2  if  they  go  in  the  current  direction  of  the  switch  and  4 
otherwise.  The  switch  can  be  toggled  by  action  F  with  cost  1,  but  only  from  a  subset  of  designated  squares. 
The  goal  is  always  to  reach  a  particular  square  with  minimum  cost.  Since  these  goals  correspond  to  2  distinct 
states  (H,  —  H),  we  add  a  dummy  action  Z  with  cost  0  that  moves  from  these  (pseudo-)goal  states  to  the 
single  terminal  state  t.  For  example,  in  a  2x2  problem  ( xmax  =  ymax  =  1)  where  the  switch  can  only  be 
toggled  from  the  top-left  square  (0,0),  if  the  initial  state  so  is  X(l)  A  Y (0)  A  H,  the  optimal  plan  to  reach  the 
bottom-left  square  (0, 1)  is  (L,  F,  D,  Z)  with  cost  5. 

1 .  A  problem  with  multiple  goal  states  can  easily  be  translated  into  an  equivalent  problem  with  a  single  goal  state. 

2.  R_|_  denotes  the  set  M  U  {00} 


2 


2.2  High-Level  Actions 

In  addition  to  a  planning  problem,  our  algorithms  will  be  given  a  set  A  of  high-level  actions ,  along  with  a 
set  / (a)  of  allowed  immediate  refinements  for  each  HLA  a  £  A.  Each  immediate  refinement  consists  of  a 
finite  sequence  a  £  A*,  where  we  define  A  =  A  U  C  as  the  set  of  all  actions.  Each  HLA  and  refinement  may 
have  an  associated  precondition,  which  specifies  conditions  under  which  its  use  is  appropriate.3  To  make  a 
high-level  sequence  more  concrete  we  may  refine  it,  by  replacing  one  of  its  HLAs  by  one  of  its  immediate 
refinements,  and  we  call  one  plan  a  refinement  of  another  if  it  is  reachable  by  any  sequence  of  such  steps. 
A  primitive  refinement  consists  only  of  primitive  actions,  and  we  define  /*( a,  s)  as  the  set  of  all  primitive 
refinements  of  a  that  obey  all  HLA  and  refinement  preconditions  when  applied  from  state  s.  We  assume  no 
plan  is  a  refinement  of  itself.  Finally,  we  assume  a  special  top-level  action  Act  £  A,  and  restrict  our  attention 
to  plans  in  J*(Act,  so). 

Definition  2.  (Parr  and  Russell,  1998)  A  plan  &h*  is  hierarchically  optimal  iff 

a'1*  =  argminae7,(Actso):T(soa)=t5(s0,a). 

Remark.  Because  the  hierarchy  may  constrain  the  set  of  allowed  sequences,  g(so,ah*)  >  g(so,a*). 

When  equality  holds  from  all  possible  initial  states,  the  hierarchy  is  called  optimality-preserving. 

The  hierarchy  for  our  running  example  has  three  HLAs:  A  =  {Nav,  Go,  Act}.  Nav(a:,2/)  navigates 
directly  to  location  (x,  y)\  it  can  refine  to  the  empty  sequence  iff  the  agent  is  already  at  (x,  y),  and  otherwise 
to  any  primitive  move  action  followed  by  a  recursive  Nav(x,y).  Go (x,y)  is  like  Nav,  except  that  it  may 
flip  the  switch  on  the  way;  it  either  refines  to  (Nav(x,  y)),  or  to  (Nav(x/,  y'),  F,  Go(x,  y))  where  ( x y')  can 
access  the  switch.  Finally,  Act  is  the  top-level  action,  which  refines  to  (Go(xg,  yg),Z),  where  ( xg ,  yg)  is  the 
goal  location.  This  hierarchy  is  optimality-preserving  for  any  instance  of  the  nav-switch  domain. 


3.  Cost-Based  Descriptions  of  HLAs 

As  mentioned  in  the  introduction,  our  angelic  semantics  (MRW  ’07)  describes  the  outcome  of  a  high-level 
plan  by  its  reachable  set  of  states  (by  some  refinement).  However,  these  reachable  sets  say  nothing  about 
costs  incurred  along  the  way.  This  section  describes  a  novel  extension  of  the  angelic  approach  that  includes 
cost  information.  This  will  allow  us  to  find  good  plans  quickly  by  focusing  on  better-seeming  plans  first,  and 
pruning  provably  suboptimal  high-level  plans  without  refining  them  further. 

We  begin  with  the  notion  of  an  exact  description  Ea  of  an  HLA  a,  which  specifies,  for  each  pair  of  states 
(s,  s'),  the  minimum  cost  of  any  primitive  refinement  of  a  that  leads  from  s  to  s'  (this  generalizes  the  original 
definition  from  (MRW  ’07)). 

Definition  3.  The  exact  description  of  HLA  a  is  a  function  Ea(s)(s')  =  minbg/»(QlS):T(s,b)=s'  g(s,  b). 

Remark.  Note  that  the  set  of  primitive  refinements  may  be  infinite.  The  minimum  must  still  be  attained, 
however,  due  to  the  finiteness  and  positive-cycle  assumptions. 

Remark.  Definition  3  implies  that  if  s'  is  not  reachable  from  s  by  any  refinement  of  a,  Ea(s)(s')  =  oo. 

Definition  4.  A  valuation  is  a  function  v  :  S  — >  M+.  The  initial  valuation  Vq  has  Vo(so)  =  0  and  v0(s)  =  oo 
for  all  s  ^  Sq- 

We  can  think  of  descriptions  as  functions  from  states  to  valuations  that  specify  a  reachable  set  plus  a  finite 
cost  for  each  reachable  state  (see  Figure  1(b)).  Then,  descriptions  can  be  extended  to  functions  from  valua¬ 
tions  to  valuations,  by  defining  Ea(v)(s')  =  minses  v(s)  +  Ea(s)(s').  Finally,  these  extended  descriptions 
can  be  composed  to  produce  descriptions  for  high-level  sequences. 

3.  We  treat  these  preconditions  as  advisory,  so  for  our  purposes  a  planning  algorithm  is  complete  even  if  it  takes  them  into  account, 
and  sound  even  if  it  ignores  them. 
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Definition  5.  Given  a  sequence  a  =  (ai, . . . ,  a;v),  the  exact  transition  function  of  a  is  a  function  mapping 
valuations  to  valuations:  Ea  =  E0N  o  ...  o  Eai . 


Theorem  1.  For  any  integer  N,  final  state  sjv>  and  action  sequence  a  G  AN ,  the  minimum  over  all  state 
sequences  (sj. s^-i)  of  total  cost  Ea.  (sj_i)(sj)  equals  Ea(vo){sis[).  Moreover,  for  any  such  mini¬ 

mizing  state  sequence,  concatenating  the  primitive  refinements  of  each  HLA  a,;  that  achieve  the  minimum  cost 
Eai  (s,;_i)  (sj)  for  each  step  yields  a  primitive  refinement  of  a  that  reaches  Sjy  from  sq  with  minimal  cost. 

Proof.  The  proof  is  by  induction.  When  N  =  1,  the  theorem  follows  trivially  from  Definitions  3  and  4. 
When  N  >  1, 


N  f  N-l  \ 

min  V  Eai(sif^)(si)  =  min  \  EaN  (sN-i)(sN)  +  Eai  (si_i)(sj) 

r-r'  (si,...,sjv-i)  \  / 

1=1  \  1=1  / 


N-l 


=  min  £ajv(sjv-i)(sjv)+  min  V'  Pai  (si_i)(si) 

s«-l  \  (si,..,Sjv-2)  I 

=  min  (EaN(sN-1)(sN)  +  EaN_x  o  . . .  o  Eai  (v0)(sN-i)) 

§N-  1 

=  EaN  o  . . .  oEai(v0)(sN) 


□ 

By  this  theorem,  an  efficient,  compact  representation  for  Ea  would  (under  mild  conditions)  lead  to  an 
efficient  optimal  planning  algorithm.  Unfortunately,  since  deciding  even  simple  plan  existence  is  PSPACE- 
hard  (Bylander,  1994),  we  cannot  hope  for  this  in  general.  We  will  therefore  consider  principled  compact 
approximations  to  Ea  that  still  allow  for  precise  inferences  about  the  effects  and  costs  of  high-level  plans. 

3.1  Optimistic  and  Pessimistic  Bounds  on  Descriptions 

Definition  6.  A  valuation  v\  (weakly)  dominates  another  valuation  i>2,  written  v±  f  V2,  iff 
(Vs  G  S)  vi (s)  <  v2 (s). 

Definition  7.  An  optimistic  description  Oa  of  HLA  a  satisfies  (Vs)  Oa(s)  V  Ea(s). 

For  example,  our  optimistic  description  of  Go  (see  Figure  l(a/c))  specifies  that  the  cost  for  getting  to 
the  target  location  (possibly  flipping  the  switch  on  the  way)  is  at  least  twice  its  Manhattan  distance  from  the 
current  location;  moreover,  all  other  states  are  unreachable  by  Go. 

Definition  8.  A  pessimistic  description  Pa  of  HLA  a  satisfies  (Vs)  Ea(s )  V  Pa{s). 

For  example,  our  pessimistic  description  of  Go  specifies  that  the  cost  to  reach  the  destination  is  at  most 
four  times  its  Manhattan  distance  from  the  current  location. 

Remark.  For  primitive  actions  a  G  C,  Oa(s)(s')  =  Pa(s)(s')  =  g(s,  a)  iff  s'  =  T(s,  a),  oo  otherwise. 

Optimistic  and  pessimistic  descriptions  generalize  our  previous  complete  and  sound  descriptions  (MRW 
’07).  In  this  paper,  we  will  assume  that  the  descriptions  are  given  along  with  the  hierarchy.  However,  we  note 
that  it  is  theoretically  possible  to  derive  them  automatically  from  the  structure  of  the  hierarchy. 

As  with  exact  descriptions,  we  can  extend  optimistic  and  pessimistic  descriptions  and  then  compose 
them  to  produce  bounds  on  the  outcomes  of  high-level  sequences,  which  we  call  optimistic  and  pessimistic 
valuations  (see  Figure  1  (c/d)). 

Theorem  2.  Given  any  sequence  a  G  ^4^  and  state  s,  the  cost  c  =  minbe/»(a  jSo)|T(s0,b)=s  9(s  cn  b)  of  the 
best  primitive  refinement  of  a  that  reaches  s  from  s o  satisfies  OaN  o  ...  o  Oai(fo)(s)  <  c  <  PaN  o  ...  o 

Pai(vo)(s). 


4 


(a)  Properties  of  HLA  Go (xt,yt)  (precondition  X (x.,)  A  Y (ys)) 


refs 

(Nav(zt,  Vt)) 

(Nav(ic,  y ),  F,  Go(xt ,  yt ))  (Vx,  y)  s.t.  a  switch  at  ( x ,  y) 

optimistic 

— X(xs),  —Y(ys),  +X(xt),  +Y(yt),  ±H 
cost  >2*  (|  xt  -  xa\  +  \yt  -  ys|) 

pessimistic 

—X(xs).  -Y(ys),  +X(xt),  +Y(yt) 
cost  <4*  (| xt  -  xs\  +  | yt  -  2/s|) 

Figure  1:  Some  examples  taken  from  our  example  nav-switch  problem,  (a)  Refinements  and  NCSTRIPS 
descriptions  of  the  Go  HLA.  (b)  Exact  valuation  from  so  for  Go(0, 1).  Gray  rounded  rectangles 
represent  the  state  space;  in  the  top  four  states  (circles)  the  switch  is  horizontal,  and  in  the  bottom 
four  it  is  vertical.  Each  arrow  represents  a  primitive  refinement  of  Go(0, 1);  the  cost  assigned  to 
each  state  is  the  min  cost  of  any  refinement  that  reaches  it.  The  exact  reachable  set  corresponding 
to  this  HLA  is  also  outlined,  (c)  Optimistic  simple  valuation  X(0)  A  ^X(l)  A  (0)AY(1)  :  4  for 
the  example  in  (b),  as  would  be  produced  by  the  description  in  (a),  (d)  Pessimistic  simple  valuation 
X(0)  A  -hX(I)  A  -hY(O)  A  Y(l)  A  H  :  8. 


Proof.  The  theorem  is  equivalent  to  the  assertion  that  OaN  o  ...  o  Oai(v o)  A  EaN  o  ...  o  Eai(v. 0)  A 
PaN  o  ...  o  Pai  (vo)(sn).  When  N  =  1,  this  follows  trivially  from  Definitions  7  and  8.  When  N  >  1,  for 
optimistic  descriptions  (the  pessimistic  case  is  symmetric): 

OaN  o  . . .  o  OaiM(siv)  =  min  0Qjv(sAr_i)(sAr)  +  Oajv_1  o  . . .  o  Oai  (v0)(sjv-i) 

Sn-1 

<  min  -E,ajv(sjv_i)(sjv)  +  EaN_1  o  . . .  o  Eai(v0)(sN-i) 

Sn-1 

=  EaN  o  . . .  o  Eai(v0)(sN) 


□ 

Moreover,  following  Theorem  1,  these  are  the  tightest  bounds  derivable  from  a  set  of  optimistic  and 
pessimistic  descriptions. 

The  reader  might  wonder  what  descriptions  are  appropriate  for  Act.  Since  the  agent  cannot  stop  acting 
until  it  reaches  the  goal  state.  Act’s  pessimistic  descriptions  cannot  assign  finite  cost  to  any  outcome  other 
than  t.  Moreover,  the  optimistic  cost  to  t  for  Act  will  be  our  normal  notion  of  an  admissible  heuristic,  which 
could  be  automatically  derived  from  a  relaxed  version  of  the  problem  (e.g.,  a  planning  graph). 

3.2  Representing  and  Reasoning  with  Descriptions 

Whereas  the  results  presented  thus  far  are  representation-independent,  to  utilize  them  effectively  we  require 
compact  representations  for  valuations  and  descriptions  as  well  as  efficient  algorithms  for  operating  on  these 
representations. 

In  particular,  we  consider  simple  valuations  of  the  form  a  :  c  where  a  C  S  and  c  £  M+,  which  specify 
a  reachable  set  of  states  along  with  a  single  numeric  bound  on  the  cost  to  reach  states  in  this  set  (all  other 
states  are  assigned  cost  oo).  As  exemplified  in  Figure  1  (c/d),  an  optimistic  simple  valuation  asserts  that  states 
in  a  may  be  reachable  with  cost  at  least  c,  and  other  states  are  unreachable ;  likewise,  a  pessimistic  simple 
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valuation  asserts  that  each  state  in  a  is  reachable  with  cost  at  most  c,  and  other  states  may  be  reachable  as 
well.4 

Simple  valuations  are  convenient,  since  we  can  reuse  our  previous  machinery  (MRW  ’07)  for  reasoning 
with  reachable  sets  represented  as  DNF  (disjunctive  normal  form)  logical  formulae  and  HLA  descriptions 
specified  in  a  language  called  NCSTRIPS  (Nondeterministic  Conditional  STRIPS).  NCSTRIPS  is  an  exten¬ 
sion  of  ordinary  STRIPS  that  can  express  a  set  of  possible  effects  with  mutually  exclusive  conditions.  Each 
effect  consists  of  four  lists  of  propositions:  add  (+),  delete  (— ),  possibly-add  (+),  and  possibly-delete  (— ). 
Added  propositions  are  always  made  true  in  the  resulting  state,  whereas  possibly-added  propositions  may  or 
may  not  be  made  true;  in  a  pessimistic  description,  the  agent  can  force  either  outcome,  whereas  in  an  opti¬ 
mistic  one  the  outcome  may  not  be  controllable.  By  extending  NCSTRIPS  with  cost  bounds  (which  can  be 
computed  by  arbitrary  code),  we  produce  descriptions  suitable  for  the  approach  taken  here.  Figure  1(a)  shows 
possible  descriptions  for  Go  in  this  extended  language  (as  is  typically  the  case,  these  descriptions  could  be 
made  more  accurate  at  the  expense  of  conciseness  by  conditioning  on  features  of  the  initial  state). 

With  these  representational  choices,  we  require  an  algorithm  for  progressing  a  simple  valuation  repre¬ 
sented  as  a  DNF  reachable  set  plus  numeric  cost  bound  through  an  extended  NCSTRIPS  description.  If 
we  perform  this  progression  exactly,  the  output  may  not  be  a  simple  valuation  (since  different  states  in  the 
reachable  set  may  produce  different  cost  bounds).  Thus,  we  will  instead  consider  an  approximate  progression 
algorithm  that  projects  results  back  into  the  space  of  simple  valuations.  Applying  this  algorithm  repeatedly 
will  allow  us  to  compute  optimistic  and  pessimistic  simple  valuations  for  entire  high-level  sequences. 

The  algorithm  is  a  simple  extension  of  that  given  in  (MRW  ’07),  which  progresses  each  (conjunctive 
clause,  conditional  effect)  pair  separately  and  then  disjoins  the  results.  This  progression  proceeds  by  (1) 
conjoining  effect  conditions  onto  the  clause  (and  skipping  this  clause  if  a  contradiction  is  created),  (2)  making 
all  added  (resp.  deleted)  literals  true  (resp.  false),  and  finally  (3)  removing  literals  from  the  clause  if  false 
(resp.  true)  and  possibly-added  (resp.  possibly-deleted).  With  our  extended  NCSTRIPS  descriptions,  each 
(clause,  effect)  pair  also  produces  a  cost  bound.  When  progressing  optimistic  (resp.  pessimistic)  valuations, 
we  simply  take  the  min  (resp.  max)  of  all  these  bounds  plus  the  initial  bound  to  get  the  cost  bound  for  the 
final  valuation.5 

Our  above  definitions  need  some  minor  modifications  to  allow  for  such  approximate  progression  algo¬ 
rithms.  For  simplicity,  we  will  absorb  any  additional  approximation  into  our  notation  for  the  descriptions 
themselves: 

Definition  9.  An  approximate  progression  algorithm  corresponds  to,  for  each  extended  optimistic  and  pes¬ 
simistic  description  Oa  and  Pa,  (further)  approximated  descriptions  ()„  and  Pa.  Call  the  algorithm  correct 
if,  for  all  actions  a  and  valuations  v,  Oa(v)  A  Oa{v)  and  Pa(v )  A  Pa(y). 

Intuitively,  a  progression  algorithm  is  correct  as  long  as  the  errors  it  introduces  only  further  weaken  the 
descriptions. 

Theorem  3.  Theorem  2  still  holds  if  we  use  any  correct  approximate  progression  algorithm,  replacing  each 
Oa  and  Pa  with  its  further  approximated  counterpart  Oa  and  Pa. 

The  proof  is  similar  to  that  of  Theorem  2. 


4.  Offline  Search  Algorithms 

This  section  describes  algorithms  for  the  offline  planning  setting,  in  which  the  objective  is  to  quickly  find  a 
low-cost  sequence  of  actions  leading  all  the  way  from  so  to  t. 

4.  More  interesting  tractable  classes  of  valuations  are  possible;  for  instance,  rather  than  using  a  single  numeric  bound,  we  could  allow 
linear  combinations  of  indicator  functions  on  state  variables. 

5.  A  more  accurate  algorithm  for  pessimistic  progression  sorts  the  clauses  by  increasing  pessimistic  cost,  computes  the  minimal  prefix 
of  this  list  whose  disjunction  covers  all  of  the  remaining  clauses,  and  then  restricts  the  max  over  cost  bounds  to  clauses  in  this  prefix. 
We  did  not  implement  this  version,  since  it  requires  many  potentially  expensive  subsumption  checks. 


6 


(a) 


2  _i  sqo h  ■ 


4  jjsoih^  )  +0 

■D - * - 

■R- — —  >.r  rrv  , 


(b) 


{•sio/i/O 

{•sio/i}:0 


[2.  2] 

L* 


{soofc}:2 


{aoo/i}:2 


[4,  4] 

-D-* 

[2,2] 


{soifc}:  6 


{soih}: 6 


vNav0i 


{•siofc}:  4 
{sioa}: 4 


[0,  0] 

Navoi* 


Nav01» 
[6,  6] 


{soih}: 6 

{soih}: 6 


[0,  0] 

-Z-+- 


W]6 

{*}=6 


{•soi/T  10 
{soift}:  10 


— Z- 

[0, 0]U 


{i}: io 


{!}:  10 


[2,  2] 


(soo/i}:  2 

{soou}:3 

{s01/i,S01i;}-5 

{t}-  5 

{soo/i}:  2 

[1.  1] 

{soow}:3 

[2,  4] 

{soiv}  :  7 

[0,  0] 

{ty.7 

Figure  2:  (a)  A  standard  lookahead  tree  for  our  example.  Nodes  are  labeled  with  states  (written  sxy(h/v)) 
and  costs-so-far,  edges  are  labeled  with  actions  and  associated  costs,  and  leaves  have  a  heuristic 
estimate  of  the  remaining  distance-to-goal.  (b)  An  abstract  lookahead  tree  (ALT)  for  our  example. 
Nodes  are  labeled  with  optimistic  and  pessimistic  simple  valuations  and  edges  are  labeled  with 
(possibly  high-level)  actions  and  associated  optimistic  and  pessimistic  costs. 


Because  we  have  models  for  our  HLAs,  our  planning  algorithms  will  resemble  existing  algorithms  that 
search  over  primitive  action  sequences.  Such  algorithms  typically  operate  by  building  a  lookahead  tree 
(see  Figure  2(a)).  The  initial  tree  consists  of  a  single  node  labeled  with  the  initial  state  and  cost  0,  and 
computations  consist  of  leaf  node  expansions:  for  each  primitive  action  a,  we  add  an  outgoing  edge  labeled 
with  that  action  and  its  cost  g(s,  a),  whose  child  is  labeled  with  the  state  s'  =  T(s,  a)  and  total  cost  to  s'.  We 
also  include  at  leaf  nodes  a  heuristic  estimate  h(s')  of  the  remaining  cost  to  the  goal.  Paths  from  the  root  to  a 
leaf  are  potential  plans;  for  each  such  plan  a,  we  estimate  the  total  cost  of  its  best  continuation  by  f(so,  a)  = 
g(s o,a)  +  h(T(so,a)),  the  sum  of  its  cost  and  heuristic  value.  If  the  heuristic  h  never  overestimates,  we 
call  it  admissible ,  and  this  /-cost  will  also  never  overestimate.  If  h  also  obeys  the  triangle  inequality  h(s)  < 
g(s,  a)  +  h(T(s,  a)),  we  call  it  consistent,  and  expanding  a  node  will  always  produce  extensions  with  greater 
or  equal  /-cost.  These  properties  are  required  for  A*  and  its  graph  version  (respectively)  to  efficiently  find 
optimal  plans. 

In  hierarchical  planning  we  will  consider  algorithms  that  build  abstract  lookahead  trees  (ALTs).  In 
an  ALT,  edges  are  labeled  with  (possibly  high-level)  actions  and  nodes  are  labeled  with  optimistic  and 
pessimistic  valuations  for  corresponding  partial  plans.  For  example,  in  the  ALT  in  Figure  2(b),  by  doing 
(Nav(0,  0),  F,  Go(0, 1)),  state  Soiu  is  definitely  reachable  with  cost  in  [5,7],  soi/t  may  be  reachable  with 
cost  at  least  5,  and  no  other  states  are  possibly  reachable.  Since  our  planning  algorithms  will  try  to  find 
low-cost  solutions,  we  will  be  most  concerned  with  finding  optimistic  (and  pessimistic)  bounds  on  the  cost 
of  the  best  primitive  refinement  of  each  high-level  plan  that  reaches  t.  These  bounds  can  be  extracted  di¬ 
rectly  from  the  final  ALT  node  of  each  plan;  for  instance,  the  optimistic  and  pessimistic  costs  to  t  of  plan 
(Nav(0, 0),  F,  Go(0, 1),  Z)  are  [5,7]. 

In  a  generalization  of  the  ordinary  notion  of  consistency,  we  will  sometimes  desire  consistent  HLA  de¬ 
scriptions,  under  which  we  never  lose  information  by  refining.6  As  in  the  flat  case,  when  descriptions  are 
consistent,  the  optimistic  cost  to  t  (i.e.,  /-cost)  of  a  plan  will  never  decrease  with  further  refinement.  Simi¬ 
larly,  its  best  pessimistic  cost  will  never  increase. 

6.  Specifically,  a  set  of  optimistic  descriptions  (plus  approximate  progression  algorithm,  if  applicable)  is  consistent  iff,  when  we 
refine  any  high-level  plan,  its  optimistic  valuation  dominates  the  optimistic  valuations  of  its  refinements.  A  set  of  pessimistic 
descriptions  (plus  progression  algorithm)  is  consistent  iff  the  state-wise  minimum  of  a  set  of  refinements’  pessimistic  valuations 
always  dominates  the  pessimistic  valuation  of  the  parent  plan. 
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We  first  describe  our  ALT  data  structures  and  how  they  address  some  of  the  issues  that  arise  in  our 
hierarchical  planning  framework  in  novel  ways.  We  then  present  our  optimal  planning  algorithm,  AHA*,  and 
briefly  describe  an  alternative  “satisficing”  algorithm,  AHSS. 

4.1  Abstract  Lookahead  Trees 

Our  ALT  data  structures  support  our  search  algorithms  by  efficiently  managing  a  set  of  candidate  high-level 
plans  and  associated  valuations.  The  issues  involved  differ  from  the  primitive  setting  because  nodes  store 
valuations  rather  than  single  states  and  exact  costs,  and  because  (unlike  node  expansion)  plan  refinement  is 
“top-down"  and  may  not  correspond  to  simple  extensions  of  existing  plans. 

Algorithm  1  shows  pseudocode  for  some  basic  ALT  operations.  Our  search  algorithms  work  by  first 
creating  an  ALT  containing  some  initial  set  of  plans  using  MakeInitialALT,  and  then  repeatedly  refining 
candidate  plans  using  RefinePlanEdge,  which  only  considers  refinements  whose  preconditions  are  met 
by  at  least  one  state  in  the  corresponding  optimistic  reachable  set.  Both  operations  internally  call  AddPlan, 
which  adds  a  plan  to  the  ALT  by  starting  at  the  existing  node  corresponding  to  the  longest  prefix  shared  with 
any  existing  plan,  and  creating  nodes  for  the  remaining  plan  suffix  by  progressing  its  valuations  through  the 
corresponding  action  descriptions.  In  the  process,  partial  plans  that  are  provably  dominated  and  plans  that 
cannot  possibly  reach  the  goal  are  recognized  and  skipped  over. 

Theorem  4.  If  a  node  n  with  optimistic  valuation  0(n )  is  created  while  adding  plan  a,  and  another  node 
nf  exists  with  pessimistic  valuation  P(n')  s.t.  P(nr)  A  0(n)  and  the  remaining  plan  suffix  of  a.  is  a  legal 
hierarchical  continuation  from  n' ,  then  a  is  safely  prunable. 

Proof  We  must  show  that  if  any  primitive  refinement  of  a  is  hierarchically  optimal,  then  there  exists  a 
primitive  refinement  of  a  plan  passing  through  n!  that  is  hierarchically  optimal  as  well  (and  thus  we  don’t 
lose  hierarchical  optimality  by  pruning  a).  Suppose  that  b  £  7*(so,a)  is  hierarchically  optimal  with  cost 
c.  Decompose  a  into  ai,  the  set  of  actions  leading  up  to  node  n,  and  &2,  the  remainder  of  the  actions  in 
a.  Decompose  b  similarly,  so  that  bi  £  I*(so,ai),  and  b2  £  I*(s,a-2 ),  where  s  =  T(so,bi)  and  by 
hierarchical  optimality  of  b,  T(s,  b2)  =  t.  Let  C\  =  g(so,  bi)  and  C2  =  g(s,  b2)  so  that  c  =  C\  +  C2 .  Now, 
by  the  definition  of  optimistic  descriptions,  we  must  have  0(n)(s)  <  C\.  Let  c  be  the  sequence  of  actions 
leading  up  to  n' .  Because  P[n' )  A  O(n),  we  must  have  P(n')(s)  <  c\.  Thus,  by  definition  of  pessimistic 
descriptions,  there  exists  d  £  I*(sq,c)  such  that  T(so,d)  =  s  and  g(so,d)  <  c\.  Finally,  since  a2  is 
an  allowed  continuation  from  n! ,  concatenating  d  and  b2  yields  a  primitive  plan  that  reaches  t,  is  a  valid 
hierarchical  primitive  refinement  of  a  plan  passing  through  n' ,  and  has  total  cost  <  c.  Thus,  either  b  was  not 
hierarchically  optimal  in  the  first  place,  or  this  new  refinement  of  a  plan  passing  through  n'  is  hierarchically 
optimal  as  well.  □ 

Remark.  The  continuation  condition  is  needed  since  the  hierarchy  might  allow  better  continuations  from 
node  n  than  n! . 

For  example,  the  plan  (L,  R,  Nav(0, 1),  Z)  in  Figure  2(b)  is  prunable  since  its  optimistic  valuation  is  dom¬ 
inated  by  the  pessimistic  valuation  above  it,  and  the  empty  continuation  is  allowed  from  that  node.  Since 
detecting  all  pruned  nodes  can  be  very  expensive,  our  implementation  only  considers  pruning  for  nodes  with 
singleton  reachable  sets. 

One  might  wonder  why  RefinePlanEdge  refines  a  single  plan  at  a  given  HLA  edge,  rather  than  simul¬ 
taneously  refining  all  plans  that  pass  through  it.  The  reason  is  that  after  each  refinement  of  the  HLA,  it  would 
have  to  continue  progression  for  each  such  plan’s  suffix.  This  could  be  needlessly  expensive,  especially  if 
some  such  plans  are  already  thought  to  be  bad. 

In  any  case,  when  valuations  are  simple,  we  can  use  a  novel  improvement  called  upward  propagation 
(implemented  in  RefinePlanEdge)  to  propagate  new  information  about  the  cost  of  a  refined  HLA  edge 
to  other  plans  that  pass  through  it,  without  having  to  explicitly  refine  them  or  do  any  additional  progression. 
This  improvement  hinges  on  the  fact  that  with  simple  valuations,  the  optimistic  and  pessimistic  costs  for  a 
plan  can  be  broken  down  into  optimistic  and  pessimistic  costs  for  each  action  in  that  plan  (see  Figure  2(b)). 


Algorithm  1  :  Abstract  lookahead  tree  (ALT)  operations 
function  ADDPLAN(n,  (01, a*,)) 
for  i  from  1  to  k  do 

if  node  n[a,]  does  not  exist  then 

create  n[oi]  from  n  and  the  descriptions  of  a; 
if  n[ai\  is  prunable  via  Theorem  4  then  return 
n  <—  n[a,i] 

if  0(n)(t)  <  oo  then  mark  n  as  a  valid  refinable  plan 

function  MakeInitialALT(so,  plans) 

root  <—  a  new  node  with  0(root)  =  P(root )  =  Vo 
for  each  plan  €  plans  do  ADDPLAN(root,p(an) 
return  root 

function  REFiNEPLANEDGE(rooi,  (an,  ...,afc),  i) 
mark  node  root[af\...[ak\  as  refined 

for  (6i...6j)€/(oj)  w /  preconditions  met  by  some  0(root[ai]...[a,_i])  do 

ADDPLAN(root,  (ai,  6i . bj,ai+ 1, afe)) 

( o,p )  <—  (min,  max)  of  the  (optimistic,  pessimistic)  costs  of  ai’ s  refs 
ai’s  optimistic  cost  <—  max(current  value,  o)  /*  upward  */ 

ai  s  pessimistic  cost  <—  min(cun'ent  value,  p)  /*  propagation  */ 


Theorem  5.  The  min  optimistic  cost  of  any  refinement  of  H LA  a  is  a  valid  optimistic  cost  for  a’s  current 
optimistic  reachable  set,  and  when  pessimistic  descriptions  are  consistent,  the  max  such  pessimistic  cost  is 
similarly  valid. 

Proof  First  consider  optimistic  valuations.  Define  v  =  (cr,  c )  to  be  the  optimistic  valuation  just  before  the 
HLA  a  being  refined,  and  tq , hn  be  the  immediate  refinements  of  a  in  this  context.  Let  va  be  the  valuation 
resulting  from  doing  a  from  v,  and  ry,, , Ub„  be  the  valuations  after  doing  each  b,.  (Each  of  these  vf  s  is  a 
simple  valuation,  and  has  a  corresponding  a,  and  ct.) 

Now,  a  optimistic  simple  valuation  asserts  that  no  states  outside  the  reachable  set  are  possibly  reachable, 
and  states  in  the  reachable  set  are  reachable  with  cost  no  less  than  c.  This  implies  that  no  states  in  the  exact 
reachable  set  are  reachable  with  cost  less  than  c.  Now,  recall  that  every  primitive  refinement  of  the  parent  plan 
is  a  primitive  refinement  of  at  least  one  of  its  immediate  refinements.  Thus,  the  cost  to  reach  each  actually 
reachable  state  is  lower  bounded  by  at  least  one  of  the  refinements’  optimistic  valuations.  Thus,  the  minimum 
cost  of  any  such  valuation  is  a  valid  optimistic  cost  for  the  original  reachable  set. 

For  pessimistic  simple  valuations,  things  are  a  little  more  complicated.  Recall  that  a  pessimistic  simple 
valuation  asserts  that  each  state  in  the  pessimistic  set  is  reachable  with  cost  at  most  c.  This  cost  statement  is 
not  equivalent  to  a  statement  about  the  exact  reachable  set  as  in  the  optimistic  simple  valuation  case.  Thus, 
for  things  to  work  we  must  require  consistent  pessimistic  descriptions  as  well. 

With  the  consistency  requirement,  we  have  that  the  union  of  the  pessimistic  reachable  sets  for  the  re¬ 
finements  must  be  a  superset  of  the  pessimistic  reachable  set  of  the  parent  plan.  Thus,  every  state  claimed 
to  be  reachable  by  the  parent  plan  is  claimed  to  be  reachable  by  one  of  the  refinements,  with  at  most  some 
cost.  Thus,  the  maximum  cost  of  any  refinement’s  simple  valuation  is  a  valid  pessimistic  cost  for  the  parent’s 
current  reachable  set.7  □ 

Thus,  upon  refining  an  HLA  edge,  we  can  tighten  its  cost  interval  to  reflect  the  cost  intervals  of  its 
immediate  refinements,  without  modifying  its  reachable  sets.  This  results  in  better  cost  bounds  for  all  other 

7.  A  more  accurate  but  expensive  algorithm  is  possible  for  propagating  pessimistic  costs;  see  footnote  5  for  details. 
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plans  that  pass  through  this  HLA  edge,  without  needing  to  do  any  additional  progression  computations  for 
(the  suffixes  of)  such  plans.8 

4.2  Angelic  Hierarchical  A* 

Our  first  offline  algorithm  is  Angelic  Hierarchical  A*  (AHA*),  a  hierarchically  optimal  planning  algorithm 
that  takes  advantage  of  the  semantic  guarantees  provided  by  optimistic  and  pessimistic  descriptions.  AHA* 
(see  Algorithm  2)  is  essentially  A*  in  refinement  space,  where  the  initial  node  is  the  plan  (Act),  possible 
“actions”  are  refinements  of  a  plan  at  some  HLA,  and  the  goal  set  consists  of  the  primitive  plans  that  reach  t 
from  so-  The  algorithm  repeatedly  expands  a  node  with  smallest  optimistic  cost  bound,  until  a  goal  node  is 
chosen  for  expansion,  which  is  returned  as  an  optimal  solution. 

More  concretely,  at  each  step  AHA*  selects  a  high-level  plan  a  with  minimal  optimistic  cost  to  t  (e.g., 
the  bottom  plan  in  Figure  2(b)).  Then  it  refines  a,  selecting  some  HLA  a  and  adding  to  the  ALT  all  plans 
obtained  from  a  by  replacing  a  with  one  of  its  immediate  refinements. 

We  will  make  the  technical  assumption  that  for  every  c,  there  are  only  finitely  many  high-level  plans  with 
optimistic  cost  less  than  c.  This  is  essentially  a  positive-cost-cycle  condition  on  the  optimistic  costs,  and  is 
not  hard  to  ensure  in  practice.  Under  this  assumption,  we  have  the  following  theorem. 

Theorem  6.  AHA*  eventually  terminates,  and  returns  a  hierarchically  optimal  plan. 

Proof.  We  will  show  that  at  the  beginning  of  each  iteration  of  the  loop,  the  lookahead  tree  contains  a  plan  b 
which  can  be  refined  to  an  hierarchically  optimal  primitive  plan.  This  is  certainly  true  at  the  first  iteration. 
By  induction,  suppose  it  is  true  at  the  k'h  iteration.  Now,  if  there  exists  such  a  plan  that  is  not  chosen  for 
refinement,  then  it  will  continue  to  be  in  the  tree  on  the  next  iteration.  So  we  only  need  to  worry  about  the 
case  when  there  is  a  unique  such  plan,  and  it  is  chosen  for  refinement.  By  definition,  no  matter  which  action 
in  the  plan  is  refined,  at  least  one  refinement  will  continue  to  be  refutable  to  an  optimal  plan.  By  Theorem  4, 
the  first  such  refinement  added  to  the  tree  will  not  be  pruned. 

In  particular,  the  invariant  above  holds  when  the  loop  terminates.  At  this  point,  the  returned  plan  has 
optimistic  cost  lower  than  all  other  plans  in  the  tree.  Since  its  own  optimistic  cost  is  exact  (as  it  is  primitive), 
it  in  fact  has  minimal  cost  among  all  refinements  of  plans  currently  in  the  tree,  and  is  therefore  hierarchically 
optimal. 

Finally,  by  assumption  on  the  optimistic  costs,  all  plans  whose  cost  is  at  most  the  optimal  cost  will 
eventually  be  considered,  including  the  hierarchically  optimal  one,  which  is  primitive.  Thus  the  algorithm 
eventually  terminates.  □ 

We  now  make  concrete  the  connection  between  AHA*  and  standard  A*  search.  AHA*  clearly  differs 
from  A*  over  the  state  space,  since  the  set  of  candidate  plans  and  expansion  operations  differ.  However,  it  is 
closely  related  to  A*  or  greedy  best-first  search  in  the  space  of  abstract  plans.  This  search  space  consists  of 
all  sequences  of  high-level  or  primitive  actions  together  with  a  dummy  terminal  state  t.  The  initial  state  is  the 
plan  Act.  Given  a  rule  for  choosing  which  HLA  of  a  given  plan  to  refine  next,  the  successors  of  a  nonprimitive 
plan  are  obtained  by  substituting  the  refinements  of  that  HLA  into  the  original  plan,  and  the  associated  cost  is 
0.  A  primitive  plan’s  only  successor  is  the  terminal  state,  and  this  move’s  cost  equals  the  cost  of  the  primitive 
plan.  The  heuristic  value  of  a  plan  is  its  optimistic  cost. 

Theorem  7.  If  the  optimistic  descriptions  are  consistent,  then  the  sequence  of  plans  refined  by  AHA*  (without 
upward  propagation)  is  a  subsequence  of  the  sequence  of  plans  expanded  by  A*  over  the  corresponding  plan 
space,  for  some  sequence  of  tiebreaking  choices. 

Proof.  Let  ai, . . . ,  a*  be  the  sequence  of  plans  refined  by  AHA*,  and  St  its  set  of  unrefined  plans  at  step 
t.  We  show  inductively  the  stronger  statement  that  we  can  construct  tiebreaking  choices  for  A*  such  that 

8.  Note  that  changing  the  costs  renders  the  valuations  stored  at  descendants  of  the  refined  edge  out-of-date.  The  plan  selection  step  of 
AHA*  can  nevertheless  be  done  correctly,  by  storing  “Q-values”  of  each  edge  in  the  tree,  and  backing  up  Q-values  up  to  the  root 
whenever  upward  propagation  is  done.  With  a  little  extra  bookkeeping,  upward  propagation  can  even  be  carried  out  recursively: 
updates  to  the  cost  of  an  HLA  can  result  in  better  bounds  for  its  parent  HLA,  and  so  on. 
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Algorithm  2  :  Angelic  Hierarchical  A* 

function  FindOptimalPlan(s0,  t) 

root  <—  MakeInitialALT(s0,  {(Act)}) 

while  3  an  unrefined  plan  do 

a  <—  plan  with  min  optimistic  cost  to  t  (tiebreak  by  pessimistic  cost) 
if  a  is  primitive  then  return  a 
REFINEPLANEDGE(roof ,  a,  index  of  any  HLA  in  a) 
return  failure 


if  t>!, . . . ,  b;  denotes  its  sequence  of  expanded  plans  and  S't  denotes  its  open  list  at  step  t,  there  exist  times 
1  =  fi  <  . . .  <  tk  =  l  such  that  a,;  =  bt.  and  S,  C  S't..  The  statement  is  clearly  true  for  t\.  Consider  step  i 
of  AHA*,  which  corresponds  to  step  t,  of  A*.  Following  this  step,  the  unrefined  plans  in  the  ALT  are  ,S',+i. 
At  this  point,  the  plan  a,+i  is  tied  for  the  lowest  cost  in  ,S',+i.  S't  ,+1  contains  this  plan,  and  possibly  other  ones 
with  lower  cost.  However,  none  of  those  can  be  in  5}+i.  We  can  therefore,  by  making  appropriate  tiebreaking 
choices  in  A*,  ensure  that,  if  ti+ 1  is  the  next  time  at  which  A*  expands  a  plan  in  ,S'?+1 ,  b, .  =  ai+1,  and 
furthermore,  S}+i  C  S't.+1,  completing  the  induction.  □ 

While  AHA*  might  thus  seem  like  an  obvious  generalization  of  A*  to  the  hierarchical  setting,  we  believe 
that  it  is  an  important  contribution  for  several  reasons.  First,  its  effectiveness  hinges  on  our  ability  to  generate 
nontrivial  cost  bounds  for  high-level  sequences,  which  did  not  exist  previously.  Second,  it  derives  additional 
power  from  our  ALT  data  structures,  which  provide  caching,  pruning,  and  other  novel  improvements  specific 
to  the  hierarchical  setting. 

The  only  free  parameter  in  AHA*  is  the  choice  of  which  HLA  to  refine  in  a  given  plan;  our  implementation 
chooses  an  HLA  with  maximal  gap  between  its  optimistic  and  pessimistic  costs  (defined  below),  breaking 
ties  towards  higher-level  actions. 

Finally,  we  note  that  with  consistent  descriptions,  as  soon  as  AHA*  finds  an  optimal  high-level  plan  with 
equal  optimistic  and  pessimistic  costs,  it  will  find  an  optimal  primitive  refinement  very  efficiently.  Consis¬ 
tency  ensures  that  after  each  subsequent  refinement,  at  least  one  of  the  resulting  plans  will  also  be  optimal 
with  equal  optimistic  and  pessimistic  costs;  moreover,  all  but  the  first  such  plan  will  be  pruned.  Further 
refinement  of  this  first  plan  will  continue  until  an  optimal  primitive  refinement  is  found  without  backtracking. 

4.3  Angelic  Hierarchical  Satisficing  Search 

This  section  presents  an  alternative  algorithm.  Angelic  Hierarchical  Satisficing  Search  (AHSS),  which  at¬ 
tempts  to  find  a  plan  that  reaches  the  goal  with  at  most  some  pre-specified  cost  a.  AHSS  can  be  much  more 
efficient  than  AHA*,  since  it  can  commit  to  a  plan  without  first  proving  its  optimality. 

At  each  step,  AHSS  (see  Algorithm  3)  begins  by  checking  if  any  primitive  plans  succeed  with  cost  <  a. 
If  so,  the  best  such  plan  is  returned.  Next,  if  any  (high-level)  plans  succeed  with  pessimistic  cost  <  a,  the 
best  such  plan  is  committed  to  by  discarding  other  potential  plans.  Finally,  a  plan  with  maximum  priority 
is  refined  at  one  of  its  HLAs.  Priorities  can  be  assigned  arbitrarily;  our  implementation  uses  the  negative 
average  of  optimistic  and  pessimistic  costs,  to  encourage  a  more  depth-first  search  and  favor  plans  with 
smaller  pessimistic  cost. 

Theorem  8.  If  there  exist  primitive  plans  consistent  with  the  hierarchy,  with  cost  <  a,  AHSS  eventually 
returns  one  of  them.  Otherwise,  it  eventually  returns  failure. 

Proof.  The  algorithm  eventually  terminates  since  there  are  only  finitely  many  plans  with  optimistic  cost  <  a. 
Since  optimistic  costs  are  exact  for  primitive  plans,  it  will  never  falsely  report  success.  Suppose  there  do  exist 
primitive  plans  with  cost  <  a.  It  suffices  to  show  that  at  the  beginning  of  each  iteration,  the  tree  contains 
a  plan  one  of  whose  primitive  refinements  has  cost  <  a.  The  invariant  holds  by  assumption  at  the  first 
iteration.  Suppose  it  is  true  at  the  beginning  of  the  fcth  iteration.  It  will  continue  to  hold  after  the  if-statement, 
by  definition  of  pessimistic  costs.  We  only  need  to  consider  the  case  when  there  is  a  single  such  plan  in  the 
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Algorithm  3  :  Angelic  Hierarchical  Satisficing  Search 
function  Finds  atisficingPlan(so,  t,  a) 
root  <—  MakeInitialALT(s0,  {Act}) 

while  3  an  unrefined  plan  with  optimistic  cost  <  a  to  t.  do 
if  any  plan  has  pessimistic  cost  <  a  to  t  then 

if  any  such  plans  are  primitive  then  return  a  best  one 
else  delete  all  plans  other  than  one  with  min  pessimistic  cost 
a  <—  a  plan  with  optimistic  cost  <  a  to  t  with  max  priority 
REFINEPLANEDGE(roof ,  a,  index  of  any  HLA  in  a) 
return  failure 


tree  after  the  if-statement,  and  it  is  selected  for  refinement.  Regardless  of  which  action  in  the  plan  is  refined, 
one  of  the  refinements  will  also  have  a  primitive  refinement  with  cost  <  a.  At  least  one  such  refinement  will 
not  be  pruned.  Thus  the  invariant  holds  at  the  next  iteration.  □ 

5.  Online  Search  Algorithms 

In  the  online  setting,  an  agent  must  begin  executing  actions  without  first  searching  all  the  way  to  the  goal. 
The  agent  begins  in  the  initial  state  So,  performs  a  fixed  amount  of  computation,  then  selects  an  action  a.9  It 
then  does  this  action  in  the  environment,  moving  to  state  T(so,  a)  and  paying  cost  g(so,  a).  This  continues 
until  the  goal  state  t  is  reached.  Performance  is  measured  by  the  total  cost  of  the  actions  executed.  We  assume 
that  the  state  space  is  safely  explorable,  so  that  the  goal  is  reachable  from  any  state  (with  finite  cost),  and  also 
assume  positive  action  costs  and  consistent  heuristics/descriptions  from  this  point  forward. 

This  section  presents  our  next  contribution,  one  of  the  first  hierarchical  lookahead  algorithms.  Since  it 
will  build  upon  a  variant  of  Korf’s  (1990)  Learning  Real-Time  A*  (LRTA*)  algorithm,  we  begin  by  briefly 
reviewing  LRTA*.10 

At  each  environment  step,  LRTA*  uses  its  computation  time  to  build  a  lookahead  tree  consisting  of  all 
plans  a  whose  cost  g(so,  a)  just  exceeds  a  given  threshold.  Then,  it  selects  one  such  plan  a min  with  minimal 
/-cost  and  does  its  first  action  in  the  world.  Intuitively,  looking  farther  ahead  should  increase  the  likelihood 
that  a min  is  actually  good,  by  decreasing  reliance  on  the  (error-prone)  heuristic.  The  choice  of  candidate 
plans  is  designed  to  compensate  for  the  fact  that  the  heuristic  h  is  typically  biased  (i.e.,  admissible)  whereas 
g  is  exact,  and  thus  the  /-cost  of  a  plan  with  higher  h  and  lower  g  may  not  be  directly  comparable  to  one  with 
higher  g  and  lower  h. 

This  core  algorithm  is  then  improved  by  a  learning  rule.  Whenever  a  partial  plan  a  leading  to  a  previously- 
visited  state  s  is  encountered  during  search,  further  extensions  of  a  are  not  considered;  instead,  the  remaining 
cost-to-goal  from  s  is  taken  to  be  the  value  computed  by  the  most  recent  search  at  s.  This  augmented 
algorithm  has  several  nice  properties: 

Theorem  9.  (Korf,  1990)  If  g-costs  are  positive,  h-costs  are  finite,  and  the  state  space  is  finite  and  safely 
explorable,  then  LRTA*  will  eventually  reach  the  goal. 

Theorem  10.  (Korf,  1990)  If,  in  addition,  h  is  admissible  and  ties  are  broken  randomly,  then  given  enough 
runs,  LRTA*  will  eventually  learn  the  true  cost  of  every  state  on  an  optimal  path,  and  act  optimally  thereafter. 

However,  as  described  thus  far,  LRTA*  has  several  drawbacks.  First,  it  wastes  time  considering  obviously 
bad  plans.  (Korf  prevented  this  with  “alpha  pruning”).  Second,  a  cost  threshold  must  be  set  in  advance,  and 
picking  this  threshold  so  that  the  algorithm  uses  a  desired  amount  of  computation  time  may  be  difficult.  Both 
drawbacks  can  solved  using  the  following  adaptive  LRTA*  algorithm,  a  relative  of  Korf ’s  “time-limited  A*”: 
(1)  Start  with  the  empty  plan.  (2)  At  each  step,  select  an  unexpanded  plan  with  lowest  /-cost.  If  this  plan  has 

9.  More  interesting  ways  to  balance  real-world  and  computational  cost  are  possible,  but  this  suffices  for  now. 

10.  To  be  precise,  Korf  focused  on  the  case  of  unit  action  costs;  we  present  the  natural  generalization  to  positive  real-valued  costs. 
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greater  5-cost  than  any  previously  expanded  plan,  “lock  it  in”  as  the  current  return  value.  Expand  this  plan. 
(3)  When  computation  time  runs  out,  return  the  current  “locked-in”  plan. 

Theorem  11.  At  any  point  during  the  operation  of  this  algorithm,  let  a  be  the  current  locked-in  plan,  C2  be  its 
corresponding  “record-setting”  g-cost,  and  C\  be  the  previous  record  g-cost  (ci  <  C2J.  Given  any  threshold 
in  [ci,  C2),  LRTA*  would  choose  a.  for  execution  (up  to  tiebreaking ). 

Proof  First,  note  that  given  any  threshold  c  £  [01,02),  LRTA*  would  definitely  have  constructed  and  ex¬ 
panded  all  of  the  ancestors  of  a.  Consider  any  ancestor  of  a.  By  consistency  and  positive  action  costs,  it 
must  have  <  5-cost  and  /-cost  than  a.  Because  a  was  “record-setting”,  the  5-cost  must  actually  be  strictly  <. 
Now,  suppose  that  this  ancestor  was  not  expanded.  Then,  its  5-cost  must  be  >  c.  But,  c\  <  c  was  the  previous 
record-setting  cost,  so  we  have  a  contradiction.  Thus,  LRTA*  would  have  generated  but  not  expanded  a. 

Now,  suppose  that  LRTA*  with  threshold  c  chooses  some  other  plan  over  a  for  execution.  This  plan  must 
have  cost  >  c  to  be  present  and  unexpanded,  and  /-cost  <  that  of  a  to  be  selected.  But,  if  this  was  the  case, 
this  plan  would  have  been  selected  for  expansion  by  the  adaptive  algorithm  before  a,  and  would  have  been 
the  previous  record-setting  plan.  But,  its  cost  is  >  c  >  ci,  the  cost  of  the  previous  record-setting  plan,  a 
contradiction.  □ 

Thus,  this  modified  algorithm  can  be  used  as  an  efficient,  anytime  version  of  LRTA*.  Since  its  behavior 
reduces  to  the  original  version  for  a  particular  (adaptive)  choice  of  cost  thresholds,  all  of  the  properties  of 
LRTA*  hold  for  it  as  well. 

5.1  Angelic  Hierarchical  Learning  Real-Time  A* 

This  section  describes  Angelic  Hierarchical  Learning  Real-Time  A*  (AHLRTA*,  see  Algorithm  4),  which 
bears  (roughly)  the  same  relation  to  adaptive  LRTA*  as  AHA*  does  to  A*.  Because  a  single  HLA  can 
correspond  to  many  primitive  actions,  for  a  given  amount  of  computation  time  we  hope  that  AHLRTA*  will 
have  a  greater  effective  lookahead  depth  than  LRTA*,  and  thus  make  better  action  choices.  However,  a 
number  of  issues  arise  in  the  generalization  to  the  hierarchical  setting  that  must  be  addressed  to  make  this 
basic  idea  work  in  both  theory  and  practice. 

First,  while  AHLRTA*  searches  over  the  space  of  high-level  plans,  when  computation  time  runs  out  it 
must  choose  a  primitive  action  to  execute.  Thus,  if  the  algorithm  initializes  its  ALT  with  the  single  plan  (Act), 
it  will  have  to  consider  its  refinements  carefully  to  ensure  that  in  its  final  ALT,  at  least  one  of  the  (hopefully 
better)  high-level  plans  begins  with  an  executable  primitive.  To  avoid  this  issue  (and  to  ensure  convergence 
of  costs,  as  described  below),  we  instead  choose  to  initialize  the  ALT  with  the  set  of  all  plans  consisting  of 
a  primitive  action  followed  by  Act. 1 1  With  this  set  of  plans,  the  choice  of  which  HLA  to  refine  in  a  plan  is 
open;  our  implementation  uses  the  policy  described  above  for  AHA*. 

Second,  as  we  saw  earlier,  an  analogue  of  /-cost  can  be  extracted  from  our  optimistic  valuations.  How¬ 
ever,  there  is  no  obvious  breakdown  of  /  into  5  and  h  components,  since  a  high-level  plan  can  consist  of 
actions  at  various  levels,  each  of  whose  descriptions  may  make  different  types  and  degrees  of  characteristic 
errors.  For  now,  we  assume  that  a  set  of  higher-level  HLAs  (e.g..  Act  and  Go)  has  been  identified,  let  h  be  the 
sum  of  the  optimistic  costs  of  these  actions,  and  let  5  =  /  —  h  be  the  cost  of  the  primitives  and  remaining 
HLAs. 

Finally,  whereas  the  outcome  of  a  primitive  plan  is  a  particular  concrete  state  whose  stored  cost  can  be 
simply  looked  up  in  a  hash  table,  the  optimistic  valuations  of  a  high-level  plan  instead  provide  a  sequence 
of  reachable  sets  of  states.  In  general,  for  each  such  set  we  could  look  up  and  combine  the  stored  costs  of 
its  elements;  instead,  however,  for  efficiency  our  implementation  only  checks  for  stored  costs  of  singleton 
optimistic  sets  (e.g.,  those  corresponding  to  a  primitive  prefix  of  a  given  high-level  plan).  If  the  state  in 
a  constructed  singleton  set  has  a  stored  cost,  progression  is  stopped  and  this  value  is  used  as  the  cost  of 

11.  Note  that  with  this  choice,  the  plans  considered  by  the  agent  may  not  be  valid  hierarchical  plans  (i.e.,  refinements  of  Act  ).  However, 
since  the  agent  can  change  its  mind  on  each  world  step,  the  actual  sequence  of  actions  executed  in  the  world  is  not  in  general 
consistent  with  the  hierarchy  anyway. 
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Algorithm  4  :  Angelic  Hierarchical  Learning  Real-Time  A* 
function  HierarchicalLookaheadAgent(so,  t) 
memory  <—  an  empty  hash  table 

while  so  f  t  do 

root  <—  MakeInitialALT(so,  {(a,  Act)  |  a  €  £}) 

{g,  a,f)  <-  (— 1,  nil,  0) 

while  3  unrefined  plans  from  root  A  time  remains  do 
a  <—  a  plan  w/  min  /-cost 
if  the  g- cost  of  a  >  g  then 

(g,a,f)  <—  (y-cost  of  a,  ai,  /-cost  of  a) 
REFINEPLANEDGE(roof ,  a,  some  index,  memory) 
do  a  in  the  world 
memory  [so]  <—  / 
so  <—  T(so,  a) 


the  remainder  of  the  plan.  This  functionality  is  added  by  modifying  RefinePlanEdge  and  AddPlan 
accordingly  (not  shown). 

Given  all  of  these  choices,  we  have  the  following: 

Theorem  12.  AHLRTA*  reduces  to  adaptive  LRTA*,  given  a  “flat"  hierarchy  in  which  Act  refines  to  any 
primitive  action  followed  by  Act  (or  the  empty  sequence). 

Proof.  Trivial;  simply  note  that  refining  a  plan  in  the  “flat”  hierarchy  is  the  same  as  expanding  a  plan  in  the 
primitive  LRTA*  setting.  □ 

(In  fact,  this  is  how  we  have  implemented  LRTA*  for  our  experiments.)  Moreover,  the  desirable  prop¬ 
erties  of  LRTA*  also  hold  for  AHLRTA*  in  general  hierarchies.  This  follows  because  AHLRTA*  behaves 
identically  to  LRTA*  in  neighborhoods  in  which  every  state  has  been  visited  at  least  once. 

Theorem  13.  If  primitive  g-costs  are  positive,  f  -costs  are  finite,  and  the  state  space  is  finite  and  safely 
explorable,  then  AHLRTA*  will  eventually  reach  the  goal. 

Proof.  Simply  note  that  AHLRTA*  is  actually  equivalent  to  LRTA*  with  depth  1,  where  the  “heuristic”  is 
computed  by  a  limited  hierarchical  search  from  each  next  state  reachable  by  some  primitive  action.  □ 

Theorem  14.  If,  in  addition,  f -costs  are  admissible,  ties  are  broken  randomly,  and  the  hierarchy  is  optimality¬ 
preserving,  then  over  repeated  trials  AHLRTA*  will  eventually  learn  the  true  cost  of  every  state  on  an  optimal 
path  and  act  optimally  thereafter. 

Proof.  Same  as  previous  theorem.  □ 

If  f-costs  are  inadmissible  or  the  hierarchy  is  not  optimality-preserving,  the  theorem  still  holds  if  so  is 
sampled  from  a  distribution  with  support  on  S  in  each  trial. 

Our  implementation  of  AHLRTA*  includes  two  minor  changes  from  the  version  described  above,  which 
we  have  found  to  increase  its  effectiveness.  First,  it  sometimes  throws  away  some  of  its  allowed  computation 
time,  so  that  the  number  of  refinements  taken  per  allowed  initial  primitive  action  is  constant;  this  tends  to 
improve  the  interaction  of  the  lookahead  strategy  with  the  learning  rule.  Second,  when  deciding  when  to 
“lock  in”  a  plan  it  requires  additionally  that  the  plan  is  more  refined  than  the  previous  locked  in  plan;  this 
helps  counteract  the  implicit  bias  towards  higher-level  plans  caused  by  aggregation  of  costs  from  primitives 
and  various  HLAs  into  g-cost.  Since  both  changes  effectively  only  change  the  stopping  time  of  the  algorithm, 
its  desirable  properties  are  preserved. 
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6.  Experiments 

This  section  describes  results  for  the  above  algorithms  on  two  domains:  our  “nav-switch”  running  example, 
and  the  warehouse  world  (MRW  ’07). 12 

The  warehouse  world  is  an  elaboration  of  the  well-known  blocks  world,  with  discrete  spatial  constraints 
added.  In  this  domain,  a  forklift-like  gripper  hanging  from  the  ceiling  can  move  around  and  manipulate 
blocks  stacked  on  a  table.  Both  gripper  and  blocks  occupy  single  squares  in  a  2-d  grid  of  allowed  positions. 
The  gripper  can  move  to  free  squares  in  the  four  cardinal  directions,  turn  (to  face  the  other  way)  when  in  the 
top  row,  and  pick  up  and  put  down  blocks  from  either  side.  Each  primitive  action  has  unit  cost.  Because 
of  the  limited  maneuvering  space,  warehouse  world  problems  can  be  rather  difficult.  For  instance.  Figure  3 
shows  a  problem  that  cannot  be  solved  in  fewer  than  50  primitive  steps.  The  figure  also  shows  our  HFAs  for 
the  domain,  which  we  use  unchanged  from  (MRW  ’07)  along  with  the  NCSTRIPS  descriptions  therein  (to 
which  we  add  simple  cost  bounds).  We  consider  six  instances  of  varying  difficulty. 

For  the  nav-switch  domain,  we  consider  square  grids  of  varying  size  with  3  randomly  placed  switches, 
where  the  goal  is  always  to  navigate  from  one  corner  to  the  other.  We  use  the  hierarchy  and  descriptions 
described  above. 

We  first  present  results  for  our  offline  algorithms  on  these  domains  (see  Table  1).  On  the  warehouse  world 
instances,  nonhierarchical  (flat)  A*  does  reasonably  well  on  small  problems,  but  quickly  becomes  impractical 
as  the  optimal  plan  length  increases.  AHA*  is  able  to  plan  optimally  in  larger  problems,  but  for  the  largest 
instances,  it  too  runs  out  of  time.  The  reason  is  that  it  must  not  only  find  the  optimal  plan,  but  also  prove  that 
all  other  high-level  plans  have  higher  cost.  In  contrast,  AHSS  with  a  threshold  of  oo  is  able  to  solve  all  the 
problems  fairly  quickly. 

We  also  included,  for  comparison,  results  for  the  Hierarchical  Forward  Search  (HFS)  algorithm  (MRW 
’07),  which  does  not  consider  plan  cost.  When  passed  a  threshold  of  oo,  AHSS  has  the  same  objective 
as  HFS:  to  find  any  plan  from  s0  to  t  with  as  little  computation  as  possible.  However,  AHSS  has  several 
important  advantages  over  HFS.  First,  its  priority  function  serves  as  a  heuristic,  and  usually  results  in  higher- 
quality  plans  being  found.  Second,  AHSS  is  actually  much  simpler.  In  particular,  whereas  HFS  required 
iterative  deepening,  cycle  checking,  and  a  special  plan  decomposition  mechanism  to  ensure  completeness  and 
efficiency,  the  use  of  cost  information  allows  AHSS  to  naturally  reap  the  same  benefits  without  needing  any 
such  explicit  mechanisms.  Finally,  the  abstract  lookahead  tree  data  structure  provides  caching  and  decreases 
the  number  of  NCSTRIPS  progressions  required.  Due  to  these  improvements,  HFS  is  slightly  slower  than 
the  optimal  planner  AHA*,  and  a  few  orders  of  magnitude  slower  than  AHSS. 

On  the  nav-switch  instances,  results  are  qualitatively  similar.  Again,  flat  A*  quickly  becomes  impractical 
as  the  problem  size  grows.  However,  in  this  domain,  AHA*  actually  performs  very  well,  almost  matching  the 
performance  of  AHSS.  The  reason  is  that  in  this  domain,  the  descriptions  for  Nav  are  exact,  and  thus  AHA* 
can  very  quickly  find  a  provably  optimal  high-level  plan  and  refine  it  down  to  the  primitive  level  without 
backtracking,  as  described  earlier. 

The  obvious  next  step  would  be  to  compare  AHA*  with  other  optimal  hierarchical  planners,  such  as 
SHOP2  on  its  “optimal”  setting.  However,  this  is  far  from  straightforward,  for  several  reasons.  First,  useful 
hierarchies  are  often  not  optimality-preserving,  and  it  is  not  at  all  obvious  how  we  should  compare  different 
“optimal”  planners  that  use  different  standards  for  optimality.  Second,  as  described  in  the  related  work  section 
below,  the  type  and  amount  of  problem-specific  information  provided  to  our  algorithms  can  be  very  different 
than  for  HTN  planners  such  as  SHOP2.  We  have  yet  to  find  a  way  to  perform  meaningful  experimental 
comparisons  under  these  circumstances. 

For  the  online  setting,  we  compared  (flat)  FRTA*  and  AHFRTA*.  The  performance  of  an  online  algorithm 
on  a  given  instance  depends  on  the  number  of  allowed  refinements  per  step.  Our  graphs  therefore  plot  total 
cost  against  refinements  per  step  for  FRTA*  and  AHFRTA*.  AHFRTA*  took  about  five  times  longer  per 
refinement  than  FRTA*  on  average,  though  this  factor  could  probably  be  decreased  by  optimizing  the  DNF 
operations.13 

12.  Our  code  and  data  are  available  at  http :  / / www .  cs  .  berkeley .  edu/~ jawolf  e/angelic/ 

13.  It  cannot  be  completely  avoided  because  refinements  for  the  hierarchical  algorithms  require  multiple  progressions. 
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Figure  3:  Left:  A  4x4  warehouse  world  problem  with  goal  ON(c,  £2)  A  ON(a,  c).  Right:  HLAs  for  warehouse 
world  domain. 
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Table  1:  Run-times  of  offline  algorithms,  rounded  to  the  nearest  second,  on  some  nav-switch  and  warehouse 
world  problem  instances.  The  algorithms  are  (flat)  graph  A*,  AHA*,  AHSS  with  threshold  a=oo, 
and  HFS  from  (MRW  ’07).  Algorithms  were  terminated  if  they  failed  to  return  within  104  seconds 
(shown  as  “-”). 


The  left  graph  of  Figure  4  is  averaged  across  three  instances  of  the  nav-switch  world.  This  domain  is 
relatively  easy  as  an  online  lookahead  problem,  because  the  Manhattan-distance  heuristic  for  Act  always 
points  in  roughly  the  right  direction.  In  all  cases,  the  hierarchical  agent  behaved  optimally  given  about  50 
refinements  per  step.  With  this  number  of  refinements,  the  flat  agent  usually  followed  a  reasonable,  though 
suboptimal  plan.  But  it  did  not  display  optimal  behavior,  even  when  the  number  of  refinements  per  step  was 
increased  to  1000. 

The  right  graph  in  Figure  4  shows  results  averaged  across  three  instances  of  the  warehouse  world.  This 
domain  is  more  challenging  for  online  lookahead,  as  the  combinatorial  structure  of  the  problem  makes  the  Act 
heuristic  less  reliable.  AHLRTA*  started  to  behave  optimally  given  a  few  hundred  refinements  per  step.  In 
contrast,  flat  lookahead  was  very  suboptimal  (note  that  the  y-axis  is  on  a  log  scale),  even  given  five  thousand 
refinements. 

Here  are  some  qualitative  phenomena  we  observed  on  the  experiments  (data  available  at  paper  website). 
First,  as  the  number  of  refinements  increased,  AHLRTA*  reached  a  point  where  it  found  a  provably  optimal 
primitive  plan  on  each  environment  step.  But  it  also  had  reasonable  behavior  when  the  number  of  refinements 
did  not  suffice  to  find  a  provably  optimal  plan  (the  left  portion  of  the  right-hand  graph),  in  that  the  “intended” 
plan  at  each  step  typically  consisted  of  a  few  primitive  actions  followed  by  increasingly  high-level  actions, 
and  this  intended  plan  was  usually  reasonable  at  the  high  level.  Second,  when  very  few  refinements  (<  50) 
were  allowed  per  step,  AHLRTA*  actually  did  worse  than  LRTA*  on  (a  single  instance  of)  the  nav-switch 
world.  While  we  do  not  completely  understand  the  cause,  what  seems  to  be  happening  is  that  in  the  regime  of 
very  little  deliberation  time  per  step,  lookahead  pathologies  and  the  LRTA*  learning  rule  interact  in  complex 
ways,  often  causing  the  agent  to  spend  long  periods  of  time  “filling  out”  local  minima  of  the  heuristic  function 
in  the  state  space.14  This  phenomenon  is  further  complicated  in  the  hierarchical  case  by  the  fact  that  the  cost 
bounds  for  different  HLAs  tend  to  be  systematically  biased  in  different  ways  (for  example,  the  optimistic 
bound  for  Nav  is  nearly  exact,  while  the  bound  for  Move  tends  to  underestimate  by  a  factor  of  two).  Improved 

14.  This  is  also  why  the  LRTA*  curve  in  the  warehouse  world  is  non-monotonic. 
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Figure  4:  Total  cost-to-goal  for  online  algorithms  as  a  function  of  the  number  of  allowed  refinements  per  en¬ 
vironment  step,  averaged  over  three  instances  each  of  the  nav-switch  domain  (left)  and  warehouse 
world  (right).  (Warehouse  world  costs  shown  in  log-scale.) 


online  lookahead  algorithms  that  degrade  gracefully  in  such  situations,  even  given  very  little  deliberation 
time,  are  an  interesting  topic  for  future  work. 

7.  Related  Work 

We  briefly  describe  work  related  to  our  specific  contributions,  deferring  to  (MRW  ’07)  for  discussion  of 
relationships  between  this  general  line  of  work  and  previous  approaches. 

Most  previous  work  in  hierarchical  planning  (Tate,  1977;  Yang,  1990;  Russell  and  Norvig,  2003)  has 
viewed  HLA  descriptions  (when  used  at  all)  as  constraints  on  the  planning  process  (e.g.,  “only  consider 
refinements  that  achieve  p”),  rather  than  as  making  true  assertions  about  the  effects  of  HLAs.  Such  HTN 
planning  systems,  e.g.,  SHOP2  (Nau  et  ah,  2003),  have  achieved  impressive  results  in  previous  planning 
competitions  and  real-world  domains — despite  the  fact  that  they  cannot  assure  the  correctness  or  bound  the 
cost  of  abstract  plans.  Instead,  they  encode  a  good  deal  of  domain-specific  advice  on  which  refinements  to 
try  in  which  circumstances,  often  expressed  as  arbitrary  program  code.  For  fairly  simple  domains  described 
in  tens  of  lines  of  PDDL,  SHOP2  hierarchies  can  include  hundreds  or  thousands  of  lines  of  Lisp  code.  In 
contrast,  our  algorithms  only  require  a  (typically  simple)  hierarchical  structure,  along  with  descriptions  that 
logically  follow  from  (and  are  potentially  automatically  derivable  from)  this  structure. 

The  closest  work  to  ours  is  by  Doan  and  Haddawy  (1995).  Their  DRIPS  planning  system  uses  action 
abstraction  along  with  an  analogue  of  our  optimistic  descriptions  to  find  optimal  plans  in  the  probabilistic 
setting.  However,  without  pessimistic  descriptions,  they  can  only  prove  that  a  given  high-level  plan  satisfies 
some  property  when  the  property  holds  for  all  of  its  refinements,  which  severely  limits  the  amount  of  pruning 
possible  compared  to  our  approach.  Helwig  and  Haddawy  (1996)  extended  DRIPS  to  the  online  setting.  Their 
algorithm  did  not  cache  backed-up  values,  and  hence  cannot  guarantee  eventual  goal  achievement,  but  it  was 
probably  the  first  principled  online  hierarchical  lookahead  agent. 

Several  other  works  have  pursued  similar  goals  to  ours,  but  using  state  abstraction  rather  than  HLAs. 
Holte  et  al.  (1996)  developed  Hierarchical  A*,  which  uses  an  automatically  constructed  hierarchy  of  state 
abstractions  in  which  the  results  of  optimal  search  at  each  level  define  an  admissible  heuristic  for  search  at 
the  next-lower  level.  Similarly,  Bulitko  et  al.  (2007)  proposed  the  PR  LRTS  algorithm,  a  real-time  algorithm 
in  which  a  plan  discovered  at  each  level  constrains  the  planning  process  at  the  next-lower  level. 

Finally,  other  works  have  considered  adding  pessimistic  bounds  to  the  A*  (Berliner,  1979)  and  LRTA* 
(Ishida  and  Shimbo,  1996)  algorithms,  to  help  guide  search  and  exploration  as  well  as  monitor  convergence. 
These  techniques  may  also  be  useful  for  our  corresponding  hierarchical  algorithms. 
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8.  Discussion 


We  have  presented  several  new  algorithms  for  hierarchical  planning  with  promising  theoretical  and  empirical 
properties.  There  are  many  interesting  directions  for  future  work,  such  as  developing  better  representations 
for  descriptions  and  valuations,  automatically  synthesizing  descriptions  from  the  hierarchy,  and  generalizing 
domain-independent  techniques  for  automatic  derivation  of  planning  heuristics  to  the  hierarchical  setting. 
One  might  also  consider  extensions  to  partially  ordered,  probabilistic,  and  partially  observable  settings,  and 
better  online  algorithms  that,  e.g.,  maintain  more  state  across  environment  steps. 
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